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& • Abstract 

Clustering in high- dimensional spaces is a difficult problem which is recurrent in many 
domains, for example in image analysis. The difficulty is due to the fact that high- 
,_i- ■ dimensional data usually live in different low-dimensional subspaces hidden in the orig- 

^sO ■ inal space. This paper presents a family of Gaussian mixture models designed for high- 

CD 

q | modeling. These models give rise to a clustering method based on the Expectation- 

Maximization algorithm which is called High-Dimensional Data Clustering (HDDC). 
In order to correctly fit the data, HDDC estimates the specific subspace and the intrin- 
sic dimension of each group. Our experiments on artificial and real datasets show that 
HDDC outperforms existing methods for clustering high-dimensional data. 



Key words: Model-based clustering, subspace clustering, high-dimensional data, Gaus- 
k> \ sian mixture models, parsimonious models. 



1 Introduction 

Clustering in high-dimensional spaces is a recurrent problem in many fields of 
science, for example in image analysis. Indeed, the data used in image analysis 
are often high-dimensional and this penalizes clustering methods. In this paper, 



we focus on model based approaches, see (UJ| for a review on this topic. Popular 
clustering methods are based on the Gaussian Mixture Model (GMM) [32|] and 
show a disappointing behavior when the size of the dataset is too small compared 
to the number of parameters to estimate. This well-known phenomenon is called 

Technical report 1083M, LMC-IMAG Revised on February 2, 2008 



curse of dimensionality and was introduced by Bellman [3J]. We refer to [35l. l36| 
for a theoretical study of the effect of dimension in the supervised framework. 

To avoid over fitting, it is necessary to find a balance between the number of 
parameters to estimate and the generality of the model. We propose a Gaus- 
sian mixture model which takes into account the specific subspace around which 
each cluster is located and therefore limits the number of parameters to estimate. 
The Expectation-Maximization (EM) algorithm [16|| is used for parameter esti- 
mation and the intrinsic dimension of each group is determined automatically 
thanks to the BIC criterion and the scree-test of Cattell. This allows to derive 
a robust clustering method in high-dimensional spaces, called High Dimensional 
Data Clustering (HDDC). In order to further limit the number of parameters, it 
is possible to make additional assumptions on the model. For example, it can 
be assumed that classes are spherical in their subspaces or fix some parameters 
to be common between classes. The nature of the proposed parametrization al- 
lows HDDC to be robust with respect to the ill-conditioning or the singularity 
of empirical covariance matrices and to be efficient in terms of computing time. 
Finally, HDDC is evaluated and compared to standard clustering methods on 
artificial and real datasets. 

This paper is organized as follows. Section [2] presents the state of the art on 
clustering of high-dimensional data. Section [3] introduces our parameterization 
of the Gaussian mixture model. Section [4] presents the clustering method HDDC, 
i.e. the estimation of the parameters of the models and of the hyper-parameters. 
Experimental results on simulated and real datasets are reported in Sectional 

2 Related work on high-dimensional clustering 

Standard methods to overcome the curse of dimensionality consist in reduc- 
ing the dimension of the data and/or to use a parsimonious Gaussian mixture 
model. More recently, methods which find clusters in different subspaces have 
been proposed. In this section, a brief survey of these works in clustering of 
high-dimensional data is presented. 



2.1 Dimension reduction 



Many methods use global dimension reduction techniques to overcome problems 
due to high dimensionality. A widely used solution is to reduce the dimension 
of data before using a classical clustering method. Dimension reduction tech- 
niques can be divided into techniques for feature extraction and feature selection. 
Feature extraction techniques build new variables carrying a large part of the 
global information. Among these techniques, the most popular one is Principal 
Component Analysis (PCA) 



271 ] which is often used in data mining and image 



analysis. However, PCA is a linear technique, i.e. it only takes into account lin- 
ear dependences between variables. Recently, many non-linear techniques have 



been proposed such as Kernel PCA j4Qfl, non 



inear PCA 
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25j and neural net- 



works based techniques JJa, [28|, [39|, [44] . In 4l|, the dimension reduction problem 
was considered in the Quadratic Discriminant Analysis framework. In contrast, 
feature selection techniques find an appropriate subset of the original variables to 
represent the data. A survey on feature selection can be found in [24J. A recent 
approach 38fl proposes to combine global feature selection and model-based clus- 
tering. These global dimension reduction techniques are often advantageous in 
terms of performance, but suffer from the drawback of losing information which 
could be discriminant. Indeed, the clusters are usually hidden in different sub- 
spaces of the original feature space and a global approach cannot capture this. 



2.2 Parsimonious models 



Another solution is to use models which require the estimation of fewer parame- 
ters. For example, the eigenvalue decomposition of the covariance matrices [2l.ll3| 
allows to re-parameterize the covariance matrix of the classes in their eigenspaces. 
By fixing some parameters to be common between classes, this parameteriza- 
tion yields parsimonious models which generate clustering methods based on the 
EM algorithm. A review on parsimonious models can be found in [22|. These 
approaches are based on various Gaussian models from the most complex one 
(a full covariance matrix for each group) to the simplest one (a spherical co- 
variance matrix for all groups) which yields a method similar to the k-means 
approach. However, these methods cannot efficiently solve the problem of the 
high-dimensionality when clusters live in low-dimensional subspaces. 



2.3 Subspace clustering 

Subspace clustering methods involve two kinds of approaches. On the one hand, 
projection pursuit clustering assumes that the class centers are located on a same 
unknown subspace [3, [L4J. On the other hand, principal component clustering 
assumes that each class is located on a unknown specific subspace, see [8||, Chap- 
ter 17, and |J| for an extension to fuzzy subspaces. For instance, the Analyse fac- 



torielle typologique 18] is based on an iterative algorithm similar to the k-means 
approach. Some subspace clustering methods use heuristic search techniques to 
find the subspaces, see for instance [l|. A review on this type of methods can 
be found in [34|]. Most of them rely on geometric considerations and are not 
model-based. Regression clustering methods (sometimes called switching regres- 
sion methods) offer an alternative based on probabilistic models. Some examples 
are [121, |32fl while the original idea is due to [21]. However, it has been observed 
that discarding some dimensions may yield instabilities in presence of outliers or 
on small datasets. For this reason, the method proposed in this paper does not 
assume that there exist irrelevant dimensions and therefore does not discard any 
dimensions, but it models the smallest variances by a single parameter. Methods 



based on mixtures of factor analyzers [33|, |45|] rely on a latent variables model 



and on an EM based procedure to cluster high-dimensional data. More recently, 
Bocci et al. [6|| proposed a similar approach to cluster dissimilarity data. The 
model of these methods is a mixture of Gaussian densities where the number of 
parameters is controlled through the dimension of the latent factor space. The 
advantage of such a model is to capture correlations without estimating full co- 
variance matrices and without dimension truncation. In this paper, we propose 
an unified approach for subspace clustering in the Gaussian mixture model frame- 
work which encompasses these approaches and involves additional regularizations 
as in parsimonious models. A precise comparison between our approach and the 
mixtures of factor analyzers is achieved in paragraph 13.21 

3 A Gaussian model for high-dimensional data 

Clustering divides a given dataset {x±, ...,x n } of n data points in W into k ho- 



mogeneous groups (see [26|| for a review). A popular clustering technique uses 



Gaussian mixture models, which assume that each class is represented by a Gaus- 
sian probability density. Data are therefore modeled by a density: 



f(x,0) = ^TT^O,^ 



(1) 



i=l 



where is a p-variate normal density with parameter di = {/ij, Xj} and 7Tj are the 
mixing proportions. This model requires to estimate full covariance matrices and 
therefore the number of parameters increases with the square of the dimension. 
However, due to the empty space phenomenon [43fl it can be assumed that high- 
dimensional data live around subspaces with a dimension lower than the one of the 
original space. We therefore introduce low-dimensional class-specific subspaces 
in order to limit the number of parameters to estimate. 



3.1 The Gaussian model [dijbiQidi] 

As in the classical Gaussian mixture model framework, we assume that class con- 
ditional densities are Gaussian J\f p (/J,i, Xj) with means /ij and covariance matrices 
£j, for i = 1, ..., k. Let Qi be the orthogonal matrix with the eigenvectors of E, 
as columns. The class conditional covariance matrix Aj is therefore defined in 
the eigenspace of S, by: 

^ i = Q\T. i Q i . (2) 

The matrix Aj is thus a diagonal matrix which contains the eigenvalues of Sj. It 
is further assumed that Aj is divided into two blocks: 
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(P ~ di) 



with dij > b i7 j = 1, ..., di, and where di 6 {1, . . . ,p — 1} is unknown. The class 
specific subspace Ej is defined as the affine space spanned by the di eigenvectors 
associated to the eigenvalues a^ and such that /ij G Ej. Similarly, the affine 
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Figure 1: The subspaces Ej and Kf- of the ith mixture component. 



subspace Kf- is such that Ej © E^- = MP and /i; £ E, 1 . In this subspace E^ 1 , the 
variance is modeled by the single parameter b\. Let Pi(x) = QiQi (x—fJ'i)+fJ>i and 
Pi~{x) = QiQl(x—fJ,i)+Hi be the projection of a; on Ej and Kf- respectively, where 
Qi is made of the di first columns of Qi supplemented by (p — di) zero columns 
and Qi = (Qi — Qi). Thus, Ej is called the specific subspace of the ith. group 
since most of the data live on or near this subspace. In addition, the dimension 
di of the subspace Ej can be considered as the instrinsic dimension of the ith 
group, i.e. the number of dimensions required to describe the main features of 
this group. Figure Q] summarizes these notations. Following the notation system 
of 



13l |. our mixture model is denoted by [a^biQidi] in the sequel. 



3.2 The sub-models of [a^biQidi] 



By fixing some parameters to be common within or between classes, we obtain 
particular models which correspond to different regularizations. In the following, 
"free Q«" means that Qi is specific for each class Cj and "common Qi" means 
that for each i — 1, ...,k, Qi = Q and consequently the class orientations are the 
same. The family [aijbiQidi] is divided into three categories: models with free 
orientations, common orientations and common covariance matrices. 



Models with free orientations They assume that the groups live in subspaces 

with different orientations, i.e. the matrices Q; are specific to each group. Clearly, 
the general model [ay&jQjdJ belongs to this category. Fixing the dimensions di 
to be common between the classes yields the model [a^biQid] which corresponds 



to the model of 
rewritten as S, = 



451 ]. Indeed, the covariance model given by ([2]) and (J3J) can be 

QiTi and where we have defined 



BiBj+ Di with Di 



Oi-Lp, J3i 



T, 




(P ~ di) 



As a consequence, our approach encompasses the mixtures of probabilistic prin- 
cipal component analysis introduced in [45|| and extended in [33J] to more general 
matrices D^. In our model, di, the number of columns of Tj, depends on the class. 
This permits the modeling of a dependence between the number of factors and the 
class. Moreover, as illustrated in paragraph 13.21 our approach can be combined 
with a "parsimonious models" strategy to further limit the number of parameters 
to estimate. Up to our knowledge, this has not been achieved yet in the mixture 
of factor analyzers model. For instance, if we further assume that di = (p — 1) 
for all i = l,...,k, the model [a^biQidi] reduces to the classical GMM with full 
covariance matrices for each mixture component which yields in the supervised 
framework the well known Quadratic Discriminant Analysis. It is possible to add 
constraints on the different parameters to obtain more regularized models. Fix- 
ing the first di eigenvalues to be common within each class, we obtain the more 
restricted model [dibiQidi] . The model [dibiQidi] often gives satisfying results, 
i.e. the assumption that each matrix Aj contains only two different eigenvalues, 
aj and bi, seems to be an efficient way to regularize the estimation of Aj. An- 
other type of regularization is to fix the parameters bi to be common between the 
classes. This yields the model [dibQidi] which assumes that the variance outside 
the class-specific subspaces is common. This can be viewed as modeling the noise 
in Kj- by a single parameter b which is natural when the data are obtained in a 
common acquisition process. This category of models contains also the models 
[abiQidi], [abQidt] and all models with free Qi and common di. 



Models with common orientations It is also possible to assume that the 
class orientations are common, i.e. Qi = Q for each i = l,...,k. However, this 
assumption does not necessarily imply that the class-specific subspaces are the 
same. Indeed, if the dimensions di are free, the intersection of the k class-specific 
subspaces is the one of the class with the smallest intrinsic dimension. This 
assumption can be interesting to model groups with some common properties and 
with additional specific characteristics. Several models of this category require a 
complex iterative estimation based on the FG algorithm [20| and therefore they 
will be not considered here. Consequently, only the models [aibiQd], [abiQd] and 
[aibQd] will be considered in this paper since their parameters can be estimated 
using a simple iterative procedure. Note that a model similar to [dijbQd] was 



considered by Flury et al. in [2l| in the supervised framework with an additional 
assumption on the means. 

Models with common covariance matrices This branch of the family only 
includes two models [djbQd] and [abQd]. Both models indeed assume that the 
classes have the same covariance matrix £ = QAQ 1 . Particularly, fixing d = 
(p — 1), the model [djbQd] reduces to a Gaussian mixture model (denoted by 
"Com-GMM" in the following) which yields in the supervised framework the well 
known Linear Discriminant Analysis (LDA). Remark that if d < (p—l), the model 
[djbQd] can be viewed as the a combination of a dimension reduction technique 
with a GMM with common covariance matrices, but without losing information 
since the information carried by the smallest eigenvalues is not discarded. 

3.3 Characteristics of the models 

Our family of models presented above only requires the estimation of <ij-dimensional 
subspaces and therefore the different models are significantly more parsimonious 
than the general Gaussian model if di <C p. Table Q] summarizes some properties 
of the models considered here. The second column of this table gives the number 
of parameters to estimate. The third column provides the asymptotic order of the 
number of parameters (i.e. with the assumption that k <C d <C p). The fourth 
column gives the number of parameters for the particular case k = 4, p = 100 and 
Vi, di = 10. The last column indicates whether the Maximum Likelihood (ML) 
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Table 1: Properties of the HDDC models: p = kp + k — 1 is the number of 
parameters required for the estimation of means and proportions, f = ^2 i=1 di\p— 
(di + l)/2] and r = d[p — (d + l)/2] are the number of parameters required for 
the estimation of Qi and Q, and D = J2i=i di- For asymptotic orders, we assume 
that k <^ d <^ p. CF means that the ML estimates are closed form. IP means 
that the ML estimation needs an iterative procedure. FG means that the ML 
estimation requires the iterative FG algorithm. 



updates are in closed form or not. These characteristics are also given for five 
Gaussian mixture models: GMM with full covariance matrices for each class 
(Full-GMM), with common covariance matrices between classes (Com-GMM), 
with diagonal covariance matrices (Diag-GMM), with spherical covariance ma- 
trices (Sphe-GMM). Note that Celeux and Govaert recommend in (l3j to make 
use of the models Diag-GMM and Sphe-GMM in clustering problems. We can 
observe that all models of our family require the estimation of fewer parameters 
than both Full-GMM and Com-GMM. In the particular case of 100-dimensional 
data, made of 4 classes and with common intrinsic dimensions di equal to 10, the 
model [dijbiQidi] only requires the estimation of 4 231 parameters whereas Full- 
GMM and Com-GMM requires respectively the estimation of 20 603 and 5 453 
parameters. Remark that the model [dijbiQidi], which gives rise to quadratic 
separation between the groups, requires the estimation of fewer parameters than 
Com-GMM, which gives rise to linear separation between the groups. 



4 High-dimensional data clustering 

In this section, we derive the EM-based clustering framework for the model 
[dijbiQidi] and its sub-models. The related clustering method is denoted by High- 
Dimensional Data Clustering (HDDC). Let us recall that unsupervised classifi- 
cation organizes data in homogeneous groups using only the observed values of 
the p explanatory variables. Usually, in model-based clustering, the parameters 
9 = {ivi, ..., ii k , $i, ..., 9 k } with 9i = {/ij,Sj} are estimated by the EM algorithm 



which repeats iteratively E and M steps. The reader could refer to [3jj for fur- 
ther informations on the EM algorithm and its extensions. In particular, the 
models presented in this paper can be also used in the Classification EM and 



Stochastic EM algorithms [l2j. Using our parameterization, the EM algorithm 



for estimating 9 = {7Tj, /ij, Ej, ay, b iy Q iy di} is detailed in the following. 



4.1 The E step 



This step computes, at iteration q and for each % — 1, ..., k and j = 1, ...,n, the 

Vj G Cj \Xj / 



conditional probability t\f = P(x, G C\ q \Xj) which can be written from 



10 



and using the Bayes formula as follows: 

Note that this conditional probability is mainly based on tx] 1 4>(xj, &f ). and 
thus can be rewritten using the parameters of the model [ciijbiQidi]. In order 
not to overload the equations, the index of the current iteration q is omitted in 
the remainder of this paragraph. Writing cj)(x, 6i) with the new class conditional 
covariance matrix A,-, we obtain: 



nj 



-2 log(0(x, 6i)) = (x- fiiYiQiAiQiy^x - fa) + log(det A*) + plog(27r). 

Since Q\Qi = I p and Qi = Qi + Qi, the above matrix inverse can be expanded as 
(Q l A i Qly 1 = QA^Ql + Q l Ar l Q\ and thus: 

-2 log(0(x, 6$) = (x - fi i ) t Q l A- 1 Q t i (x - fa) + (x- fi i ) t Q l Ar 1 Q t l (x - m) 
+ log(det A,) + plog(27r). 

Taking into account the structure of Aj and using the relations Qi{Q\Qi) = Qi 
and Qi {Q\Qi) = Qi, it yields: 

-21og(0(x,^)) = ||4OK^-/^OIlA + ^IIQ4-(x-^)|| 2 + log(detA i )+plog(27r), 

where ||.||^. is the norm on E$ such as \\x\\ 2 At = x l AiX with Ai = QiA~ x Qi . From 
the definitions of Pi and P^~ (Paragraph 13. II) and in view of Figured], we have: 

-2 log(0(z, ft)) = \\fii - Pi(x)\\ \ + -\\x-Pi(x) || 2 + log(detA i )+plog(27r). 

"i 

The relation log(det Aj) = J2/=i l°g( a y) + ip~ di) l°g(^) allows to conclude that: 
tij = 1 / J^exp (-(Ki(xj) - K t {xj))\ , 
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where K^x) = — 21og(7Tj0(x, #,)) is called the cost function and is defined by: 

i di 

Ki{x) = ||/i 4 -P i (x)||^ + -||x-P l (a;)|| 2 + ^log(a 4J ) + (p-rf i )log(6 4 )-21og(7r 4 ). 

Let us note that Ki(x) is mainly based on two distances: the distance between 
the projection of x on E$ and the mean of the class and the distance between 
the observation and the subspace Ej. This cost function favors the assignment 
of a new observation to the class for which it is close to the subspace and for 
which its projection on the class subspace is close to the mean of the class. The 
variance terms ay and bi balance the importance of both distances. For example, 
if the data are very noisy, i.e. bi is large, it is natural to balance the distance 
\\x — Pi(x)\\ 2 by 1/bi in order to take into account the large variance in E^ 1 . 

4.2 The M step 

This step maximizes at iteration q the conditional likelihood and uses the follow- 
ing update formulas. Mixture proportions and means are estimated by: 



n ' ^ -to) 



where nf = Y^j=i Aj ■ Moreover, the update formula for the empirical covariance 



j 3 =\ "ij 
matrix of the fuzzy class C{ is: 

n 

The estimation of the specific parameters of HDDC is detailed below. Proofs of 
the following results are given in the Appendix. 

Models with free orientations The ML estimators of model parameters are 
closed form for this category of models. 

- Subspace Ej: the di first columns of Qi are estimated by the eigenvectors 
associated with the di largest eigenvalues \j of Wi. 
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- Model [ctijbiQidi]: the estimator of a^ is a y - = A^ and the estimator of bi is the 
mean of the (p — di) smallest eigenvalues of Wi and can be written as follows: 

- Model [ciijbQidi]: the estimator of a^- is a y - = A^ and the estimator of b is: 

&=^(^w -£<£>). ( 5 ) 

where £ = £) i=1 7Tirfi and W = Yli=i ^iWi is the within-covariance matrix. 

- Model [ciibiQidi]: the estimator of bi is given by (jlj) and the estimator of a, is: 

di = -J2 X iJ- ( 6 ) 

^ 3=1 

- Model [abiQidi]: the estimator of bi is given by (TJJ and the estimator of a is: 

1 k di 

? t=i j=i 

- Model [aj6Qj<ij]: estimators of a« and 6 are respectively given by (jHJ) and (jSJ). 

- Model [aftQidj]: estimators of a and 6 are respectively given by (jZj) and ((SJ). 

- Models with common dimensions: the estimators of the models with common 
dimensions d, can be obtained from the previous ones by replacing the values di 
by d for each i = 1, ..., fc. In this case, equations (JHJ) and (JJJ) can be simplified as: 



1 d 

a = iE A ^ ( 8 ) 



i=i 



^H-M 



where Aj is the jth largest eigenvalue of W. 

- Model [ctjbiQid]: the estimator of a,j is dj = Aj and the estimator of bi is 
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- Model [djbQid]: the estimator of a, is dj = Xj and the estimator of b is (J9j) . 

Models with common orientations Here, we assume in addition that the 
dimensions di are common between classes. The following ML estimators require 
an iterative procedure. 

- Subspace E^: Given a* and bi, the d first columns of Q are estimated by the 
eigenvectors associated to the d largest eigenvalues of the matrix M defined by: 

k 1 1 
M(ai,...,a k ,bi,...,b k ) = /ni(- )W*. 

4 = 1 

- Model [aj6jQ<i]: given Q, estimators of a, and bi are: 

1 d 



d . 



W) = jj^jj W - E^9ij • (11) 

- Model [ai6Qidi]: given Q, the estimator of a, is (TTOjl and the estimator of 6 is: 

6(g) = ^y (rr(W) - J2 QjWqA . (12) 

- Model [abiQd]: given Q, the estimator of fej is (fTTT) and the estimator of a is: 

1 d 
d(Q) = -J2^ W( lr (I 3 ) 

i=i 

- Model [aj6Q<i]: given Q, estimators of a,i and 6 are respectively (fTOl) and (fl~2l) . 

For example, it is possible to use the following iterative procedure to estimate 
the parameters associated to the model [aibiQd]: 

- Initialization: the d first columns of Q^ are the eigenvectors associated with 
the d largest eigenvalues of W. 

- Until convergence: a\ = d;(Q^ -1 )), b\ = bi{Q^~ 1 ^) and the d first columns 
of Q™> are the eigenvectors associated to the d largest eigenvalues of the matrix 

14 
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Figure 2: Estimation of the intrinsic dimension di using the scree-test of Cat- 
tell: plot of ordered eigenvalues of Ej (left) and differences between consecutive 
eigenvalues (right). 

M(a(«,...,a?,6f>,...,&'«). 



Models with common covariance matrices In this category of models, the 
parameters can be updated in closed form. 

- Subspace E$: the d first columns of the matrix Q are the eigenvectors associated 
to the d largest eigenvalues of W. 

- Model [cLjbQd]: the estimator of Oj is dj = Xj and the estimator of b is ©. 

- Model [abQd] : estimators of a and b are respectively given by (ED and (jHJ) • 



4.3 Hyper-parameters estimation 

Within the M step, the intrinsic dimensions of each subclass have to be estimated. 
This is a difficult problem with no unique technique to use. Our approach is based 
on the eigenvalues of the class conditional covariance matrix Ej of the class Cj. 
The jth eigenvalue of E; corresponds to the fraction of the full variance carried by 
the jth eigenvector of Ej. The class specific dimension di, i — 1, ..., k is estimated 
through the scree-test of Cattell [ill] which looks for a break in the eigenvalues 
scree. The selected dimension is the one for which the subsequent eigenvalues dif- 
ferences are smaller than a threshold. Figure [2] illustrates this method: the graph 
on the right shows that the differences between eigenvalues after the fourth one 
are smaller than the threshold (dashed line). Thus, in this case, four dimensions 
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will be chosen and this corresponds indeed to a break in the scree (left graph). In 



our experiments, the threshold is chosen using the probabilistic criterion BIC [42] 
which consists in minimizing BIC(m) = — 21og(L) + vim) log(n), where vim) is 
the number of parameters of the model m given in Tabled] for HDDC, L is the 
likelihood and n is the number of observations. In addition, this approach al- 
lows to estimate k parameters by choosing only the value of the threshold t. In 
the case of common intrinsic dimensions between the groups, the dimension d is 
directly determined using BIC. The second hyper-parameter to estimate in any 
clustering method is the number of groups k. This parameter is also selected 
thanks to the BIC criterion, see the experiments presented in Section [5] 

4.4 Numerical considerations 

First, it is important to remark that the parametrization of the Gaussian model 
proposed here provides an explicit expression of S" 1 whereas other classical meth- 
ods, like Full-GMM and Com-GMM, need to numerically invert empirical covari- 
ance matrices which usually fails for singularity reasons. Some solutions however 
exist to overcome this problem for the models Full-GMM and Com-GMM, see for 



instance [29|. In contrast, this problem does not arise with HDDC since the cost 
function K t does not require to invert Ej. Moreover, it appears in (14. ip that the 
cost function Ki does not use the projection on the subspace Kf- and consequently 
does not require the computation of the (p — di) latest columns of the orientation 
matrix Qi. In Section l4~2l it is shown that the ML estimators of these columns 
are the eigenvectors associated to the (p — di) smallest eigenvalues of the empirical 
covariance matrix Wi. Therefore, HDDC does not depend on these eigenvectors 
whose determination is numerically unstable. Thus, HDDC is robust with re- 
spect to ill-conditioning and singularity problems. In addition, it is also possible 



to use this feature to reduce computing time by using the Arnoldi method [30] 
which only provides the largest eigenvalues and the associated eigenvectors of 
an ill-conditioned matrix. During our experiments, we noticed a reduction by 
a factor 60 of the computing time on a 1024-dimensional dataset compared to 
the classical approach. Furthermore, in the special case where the number of 
observations of a group n^ is smaller than the dimension p, our parametrization 
allows to use a linear algebra trick. Indeed, in this case, it is better from a nu- 
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Simulated 






HDDC i 


no del 






data model 


[dijbiQidi] 


[a^bQidi] 


[aibiQidi] 


[dibQidi] 


[abiQidi] 


[abQidi] 


[dijbiQidi] 


357 


373 


349 


359 


349 


360 


[dijbQidi] 


403 


404 


397 


396 


397 


397 


[aibiQidi] 


389 


419 


377 


391 


377 


394 


[dibQidi] 


438 


440 


419 


419 


420 


420 


[abiQidi] 


399 


433 


380 


402 


384 


403 


[abQidi] 


456 


451 


428 


427 


434 


433 



Table 2: BIC value for the HDDC models on different simulated datasets (the 
best ones are in bold). 



merical point of view to compute the eigenvectors of the rii x m matrix TjT* 
than those of the p x p matrix T*Yj, where T, is the rii x p matrix containing 
the mean-centered observations. In the case of data containing 13 observations 
in a 1024-dimensional space, it has been noticed a reduction by a factor 500 of 
the computing time compared to the classical approach. 



5 Experimental results 

In this section, we present results for artificial and real datasets illustrating the 
main features of HDDC. In the following experiments, HDDC will be compared 
to 3 classical Gaussian mixture models: GMM with full covariance matrices for 
each class (Full- GMM), with diagonal covariance matrices (Diag-GMM), with 
spherical covariance matrices (Sphe-GMM). A numerical regularization was nec- 
essary to invert the covariance matrices in the clustering method associated to 
the model Full-GMM, so that it is able to work with data of dimension larger 
than 50. 

5.1 Simulation study: model selection 

Given that HDDC is a model-based clustering method, the well-known criterion 
BIC can be used for selecting the best adapted model to the data. Here, we used 
BIC and the cluster recognition rate to compare the different models of HDDC. 
The cluster recognition rate can be computed since true partitions are known 
and is defined as the maximum rate over the correct matchings between the true 
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Simulated 






HDDC i 


no del 






data model 


[dijbiQidi] 


[a^bQidi] 


[aibiQidi] 


[dibQidi] 


[abiQidi] 


[abQidi] 


[dijbiQidi] 


0.967 


0.828 


0.973 


0.919 


0.975 


0.903 


[dijbQidi] 


0.730 


0.727 


0.779 


0.782 


0.758 


0.751 


[aibiQidi] 


0.979 


0.871 


0.983 


0.929 


0.986 


0.917 


[dibQidi] 


0.826 


0.800 


0.882 


0.863 


0.875 


0.865 


[abiQidi] 


0.965 


0.825 


0.980 


0.844 


0.952 


0.822 


[abQidi] 


0.712 


0.752 


0.797 


0.793 


0.711 


0.707 



Table 3: Cluster recognition rate for the HDDC models on different simulated 
datasets (the best ones are in bold). 

groups and the found clusters. It is impossible to report in this section numerical 
experiments for all the discussed models. Therefore, we limit ourselves to models 
with free orientations since we believe that these models are able to tackle different 
situations. We performed extensive simulations (50 replications for each of the 
6 data models) and then used the 6 different models with free orientations in 
HDDC to cluster the simulated data. For each dataset, 3 Gaussian densities are 
simulated in M 100 according to one of the 6 models with free orientations, i.e. 
free matrices Qj, and with the following parameters: {di,d 2 ,d 3 } = {2,5,10}, 
{tci, 7t 2 , 7r 3 } = {0.4,0.3,0.3} and close means and random matrices Qi. Each one 
of the 6 datasets was made of 1000 points. Tables [2] and [3] present respectively 
the BIC value and the cluster recognition rate on average for the 6 considered 
HDDC models on the different simulated datasets. First of all, it appears that 
BIC and the cluster recognition rate select in general the same models and this 
confirm that BIC is a useful tool in model-based clustering. Unsurprisingly, the 
models used to simulate the data obtain small BIC values and satisfying cluster 
recognition rates. However, it appears that the model [aibiQidi] is usually selected 
by BIC as the best model and its cluster recognition rates are very good for each 
type of simulated data. Thus, the model [aibiQidi] seems to have the right number 
of degrees of freedom and the assumption that Aj has only 2 different eigenvalues 
is an efficient way to regularize the estimation. Note that models [afiQidi] and 
[abiQidi] are also often selected by BIC and provide good cluster recognition rates. 
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Nb of groups k 


Dimensions d{ 


BIC value 


2 


2,16 


414 


3 


2,5,10 


407 


4 


2,2,5,10 


414 


5 


2,5,5,10,12 


416 


6 


2,5,6,10,10,12 


424 



Table 4: Selection of the number of groups using BIC with the model [aibiQidi] 
of HDDC: data are made of 3 groups with intrinsic dimensions di = {2, 5, 10}. 

5.2 Simulation study: hyper-parameters selection 

Here, we are interested in the selection of the number of groups and of the 
intrinsic dimension of the clusters. In this experiment, 3 Gaussian densities 
are simulated in R 100 according to the model [aibiQidi] with the following pa- 
rameters: {di, g?2, d 3 } = {2,5,10}, {vti, 7r 2 , 7r 3 } = {0.4,0.3,0.3}, {01,02,03} = 
{150,100,75}, {61, i» 2 , 63} = {15,15,15}, close means and random matrices Qi. 
The dataset was made of 1000 points. Table [4] presents the choices of group in- 
trinsic dimensions for the different values of k and the corresponding BIC values. 
First of all, it appears that the criterion BIC can be successfully used for choosing 
the number of clusters as in standard Gaussian mixture models. Indeed, the BIC 
value associated to the model [aibiQidi] are computed for different values of k, the 
number of groups, and BIC indicates that the most likely value is k = 3 which is 
correct. In addition, the intrinsic dimensions di, estimated by HDDC for k = 3, 
are indeed the ones of the simulated data. It is also interesting to observe the 
evolution of the estimation of dimensions di according to the number of clusters. 
For instance, if we consider the case of a mixture of 2 Gaussian densities, HDDC 
seems to correctly fit the first 2-dimensional cluster and create a second cluster 
made of the two other real groups. In addition, the estimated dimension of this 
second cluster is approximately the sum of the intrinsic dimensions of the two 
real groups. Similarly, for k = 4, HDDC divides the first real group into two new 
clusters with intrinsic dimensions equal to 2. As a conclusion, our approach for 
dimension estimation allows to correctly identify the cluster subspaces. 
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Figure 3: Influence of the dimensionality on the BIC value for different Gaussian 
mixture models. 
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Figure 4: Influence of the dimensionality on cluster recognition rate for different 
Gaussian mixture models. 
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5.3 Simulation study: influence of the dimensionality 

In this paragraph, we highlight the dimensionality effect on the different clus- 
tering methods. Three Gaussian densities are simulated in MP, p = 20, ...,100, 
according to the model [aibiQidi] with the same parameters as in the previous 
experiment. The performance of methods is measured by the average cluster 
recognition rate computed on 50 replications. The studied clustering methods 
were initialized using the same random partition. Figures [3] and 0] respectively 
show the influence of the dimensionality on the BIC value and the cluster recog- 
nition rate for different Gaussian mixture models: model [aibiQidi] of HDDC, 
Full-GMM, Diag-GMM and Sphe-GMM. It is not surprising to observe on Fig- 
ure [3] that BIC selects the model [aibiQidi] as the best model since the data are 
simulated according to this model. However, it interesting to remark that, the 
more the dimension increases, the larger the difference between the BIC values 
of the different models is, and that in favor of the model [aibiQidi]. Figured] 
shows that data dimension does not influence the performance of HDDC which 
is very close to the performance of the Bayes decision rule (computed with the 
true densities). In addition, HDDC provides a cluster recognition rate similar to 
Full-GMM in low dimensions. Full-GMM is known to be very sensitive to the 
data dimension and, indeed, gives bad results as soon as the dimension increases. 
The models Diag-GMM and Sphe-GMM cannot correctly fit the data since they 
are too parsimonious for this complex dataset. However, one can observe that 
Sphe-GMM is not sensitive to the data dimension whereas Diag-GMM is. To 
summarize, HDDC is not sensitive to the dimension and works very well both 
in low and in high-dimensional spaces. In addition, the model [aibiQidi] outper- 
forms models requiring a higher number of parameters (Full-GMM) and models 
requiring a smaller number of parameters (Diag-GMM and Sphe-GMM). 

5.4 Simulation study: full rank Gaussian model 

In this last simulation study, the capacity of HDDC models to deal with full rank 
Gaussian data is investigated. Three Gaussian densities in M p , p = 50, are simu- 
lated with full rank covariance matrices, i.e. according to the model Full-GMM. 
The covariance matrices of the groups were different (different orientations and 
eigenvalues) but with the same condition number fixed to 100. Recall that the 
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Figure 5: Influence of the dataset size on the condition number for the full rank 
data. 
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Figure 6: Influence of the dataset size on the cluster recognition rate for the full 
rank data. 
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condition number of a matrix is the ratio of its largest and smallest eigenvalues. 
For this experiment, we used HDDC with the model [dijbiQidi] and the clustering 
methods associated to the classical Gaussian models Full-GMM, Diag-GMM and 
Sphe-GMM. In order to observe the behavior of the studied clustering methods 
with respect to the curse of the dimensionality, the cluster recognition rate is 
computed for different dataset sizes n since this phenomenon occurs when the 
size of the dataset becomes too small compared to the dimension. As an illus- 
tration, Figure [5] presents a comparison between the condition number of the 
estimated covariance matrix associated to the first group used by the Full-GMM 
method and the ratio dn/bi, which is the corresponding condition number of the 
covariance matrix estimated by HDDC, for different sizes of the full rank dataset 
n = 150, ...,2000. It appears that, for small datasets (i.e. n smaller than 1000), 
the condition number of the empirical covariance matrix (associated to the model 
Full-GMM) explodes, whereas the condition number associated to the estimated 
covariance matrix in the model [dijbiQidi] remains stable. Figure [6] shows the 
consequence on the behavior of the studied clustering methods. First, observe 
that both Diag-GMM and Sphe-GMM models do not obtain satisfying results 
for any dataset size. This is due to the fact that the assumptions made by those 
models are wrong for the simulated data and they are thus not able to correctly 
fit these data. Second, HDDC obtains a similar cluster recognition rate to the 
model Full-GMM, which is the model used to simulate the data, when the dataset 
size is large (i.e. n larger than 1500). Furthermore, HDDC appears to be more 
efficient to cluster these data than the model Full-GMM when the dataset size 
becomes small. Indeed, the cluster recognition rate of HDDC is almost constant 
for dataset sizes between 1500 and 500. However, when the dataset size is smaller 
than 500, the HDDC performance decreases to the results obtained by the par- 
simonious models Diag-GMM and Sphe-GMM. These experiments demonstrate 
that, even with data which are not favorable to our model, HDDC is more efficient 
than both complex and parsimonious models on small datasets. 

5.5 Real data study: comparison with variable selection 

In this experiment, HDDC is compared with the variable selection method for 



model-based clustering introduced in [38||, and denoted by VS-GMM in the follow- 
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Model 


Variables 


Cluster 


recognition rate 


Sphe-GMM 


Original 




0.605 


VS-GMM 


Original 




0.925 


Sphe-GMM 


Princ. comp. 




0.605 


VS-GMM 


Princ. comp. 




0.935 


HDDC [aAQA} 


Original 




0.950 



Table 5: Classification results for the Crabs data: comparison of different model- 
based clustering methods. 
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Figure 7: Clustering results using HDDC: on the left panel, crabs data projected 
on the two first principal axes and, on the right panel, clustering result obtained 
with the model [aAQi^i] of HDDC and the estimated specific subspaces of the 
mixture components (blue lines). 
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ing. The authors considered the variable selection problem as a model selection 
problem. Selection is made using approximate Bayes factors and combined with 
a greedy search algorithm. In addition, it is possible to perform this variable 
selection on the original variables, but also on the principal components using 
PCA as a pre-processing step. In order to compare HDDC to this variable se- 
lection technique, we used the same dataset as in [38fl. The Leptograpsus crabs 
dataset consists of 200 subjects equally distributed into 4 classes: Orange Male, 
Orange Female, Blue Male and Blue Female. There are 5 variables for each sub- 
ject: width of frontal lip (FL), rear width (RW), length along the mid-line of the 
carapace (CL), maximum of the width of the carapace (CW) and body depth 
(BD) in mm. The left panel of Figure [7| shows the Crabs data projected on the 
two first principal axes and the big circles represent the cluster means. 

Table [5] gives the classification error rate for the classical model Sphe-GMM, 
the VS-GMM method and HDDC. The second column of this table indicates on 
which variables is performed the clustering. HDDC obtains a cluster recognition 
rate equal to 95% and the variable selection method of Raftery et al. obtains 
93.5% whereas the classical model Sphe-GMM obtains a cluster recognition rate 
equal to 60.5%. HDDC found that each cluster lives in a 1-dimensional subspace 
embedded into the original 5-dimensional space. The right panel of Figure [7] 
shows the specific subspaces (blue lines) of the 4 mixture components obtained 
with the model [aibiQidi] of HDDC. For this illustration, the data is projected on 
the two first principal components since results obtained with VS-GMM on these 
variables are better than on the original ones. It can be observed that the specific 
axes of the different clusters are very correlated and this explains that HDDC 
provides a better clustering result than the variable selection method VS-GMM. 

5.6 Real data study: Martian surface characterization 

Here, we propose to use HDDC to analyze and segment images of the Martian 
surface. Visible and near infrared imaging spectroscopy is a key remote sensing 
technique to study and monitor the system of the planets. Imaging spectrometers, 
which are inboard of an increasing number of satellites, provide high-dimensional 
hyper-spectral images. In March 2004, the OMEGA instrument (Mars Express, 
ESA) [5j] has collected 310 Gbytes of raw images. The OMEGA imaging spec- 
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Figure 8: Characterization of the Martian surface composition using HDDC: on 
the left, image of the studied zone and, on the right, segmentation using HDDC 
on the 256-dimensional spectral data associated to the image. 
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Figure 9: Spectral means of the 5 mineralogical classes found using HDDC. 
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trometer has mapped the Martian surface with a spatial resolution varying be- 
tween 300 to 3000 meters depending on spacecraft altitude. It acquires for each 
resolved pixel the spectrum from 0.36 to 5.2 /im in 256 contiguous spectral chan- 
nels. OMEGA is designed to characterize the composition of surface materials, 
discriminating between various classes of silicates, hydrated minerals, oxides and 
carbonates, organic frosts and ices. For this experiment, a 300 x 128 image of 
the Martian surface is considered and a 256-dimensional spectral observation is 
associated to each of the 38 400 pixels. The image of the studied zone is pre- 
sented on the left panel of Figure According to the experts, there are k = 5 
mineralogical classes to identify. 

The right image of Figure El shows the segmentation obtained with the model 
[ciibiQidi} of HDDC. First of all, observe that the segmentation of HDDC is very 
precise on the main part of the image. The poor results of the top right part of the 
image are due to the planet curvature and could be corrected. In particular, the 
experts of the domain appreciated that our method is able to detect a mixture of 
ice and carbonate around the ice zones (clear zones of the image). Figure [9] shows 
the spectral means of the 5 classes and this allows the experts to determine the 
mineralogical and molecular composition of each class. Remind that this study 
is done without taking into account the spatial relations between the pixels of 
a image. A natural extension of this work is therefore to combine HDDC with 
the modeling of the spatial relations using, for example, hidden Markov random 
fields. This experiment demonstrates that HDDC can be efficiently used on real 
high-dimensional data and with large datasets. In addition, a main interest of 
HDDC for this application is to provide posterior probabilities that each pixel 
belongs to the classes. 

6 Conclusion 

In this paper, new Gaussian mixture models designed for high-dimensional data 
are introduced. It is assumed that the intrinsic dimension of each mixture compo- 
nent is much smaller than the one of the original space. In addition, outside the 
specific subspace of each group, the noise variance is modeled by a single parame- 
ter. Additional constraints can be imposed on the parameters within or between 
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the groups in order to obtain further regularized models. This parameterization 
in the eigenspaces of the mixture components gives rise to an EM-based cluster- 
ing method, called High-Dimensional Data Clustering (HDDC). Experiments on 
artificial and real datasets demonstrated the effectiveness of the different mod- 
els of HDDC compared to classical Gaussian mixture models. In particular, the 
model [dibiQidi] provides very satisfying results for many types of data. 
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A Appendix: parameters estimation 

First of all, we introduce the following useful formulation of the log-likelihood: 
-21og(L) = ^^(log^) + ^-ql j W i q iJ )+c st , (14) 

i=l j=l ^ 'J J 

where &,- is the jth diagonal coefficient of A$ and q^ is the jth column of Qi. We 
refer to 



191 ] for a demonstration of this result. 



A.l Models with free orientations 

Subspace E$: The log-likelihood is to be maximized under the constraint q\flij 
1, which is equivalent to finding a saddle point of the Lagrange function: 

v 
£=-21og(L)-J>,(g^-l), 

where Q^ are the Lagrange multipliers. Using the expression (fT4l) of the log 
likelihood, the gradient of C with respect to g^ is: 

Tl ■ 

V q ..C = 2-^Wiqtj - Idijqij, 



Sij 






and by multiplying this quantity on the left by qjj, we obtain 






q! j V qij £ = 0&8 ij = -!-ql j W i q ij . 

Consequently, Wiq^ = -^-^-Qy and thus gy is the eigenvector of Wi associated 
with the eigenvalue Ay = - 2 ^ JJ - = qjjWiqij. As the vectors gy are eigenvectors of 
the symmetric matrix Wi, this implies that qhqu = if j 7^ £. The log-likelihood 
can therefore be re-written as follows: 

-21og(L)=^n i (X:(log(a ii ) + ^)+ £ (bgfo) + ^) ) +c st , 

j=l \j=l ^ %3 ' ' j=di + l ^ l ' ) 

and, using the relation Xy=d,+i ^u = Tr(H/j) — Xy=i ^u') we obtain: 



-c 



*/ 



-2 log(L) = J2 n * ( E lo g(%) + (P " *) lo S^) + ^T^ + E (— " r) A « 

(15) 
Thus, minimizing —2 log(L) with respect to Ay is equivalent to minimizing the 
quantity X^=i m Ej=i(^7 - £)Ay. Since (^- - ^) < 0, Vj = l,...,d i; Ay must 
therefore be as larger as possible. Thus, the column vector gy, Vj = 1, ...,di, is 
estimated by the eigenvector associated to the jth largest eigenvalue of Wi. 



Model [aijbiQidi]: starting from equation (fT5l ). the partial derivative of —2 log(L) 
with respect to ay and 6j are: 

day \ay a y/ #&i 6 » &.- V ^ 

The condition ^f = implies that dy = Ay and the condition ^ - = 
implies that: 

ft* 



raH'-p-) 






Model [dijbQidi]: the partial derivative of — 21og(L) with respect to b is: 
dlog(L) n(p - Q if / A \ 

1=1 \ J = l / 

and the condition g b ( - = proves that: 



1 / k d i \ 



(p- 



Model [ajfojQidJ: from (fT5l ). the partial derivative of — 21og(L) with respect to 
a« is: 

dlog(L) __ ni<k _ rh >A 
<9a; Oj a. 2 4-' lJ ' 

and the condition — g|~ — implies that: 






3=1 



Model [afejQi^i]: the partial derivative of — 21og(L) with respect to a is: 

dlog(L) ni 1 * * 



t=l i=l 



and the condition — #^ = gives: 



aiog(J-) 



^ t=l j=l 

Model [cijbiQid]: the partial derivative of — 21og(L) with respect to a, is: 

o dlog(L) _ n 1_A 

<9a- ~ a- o 2 ^ r 

J J J j=l 

The condition ^ ^ = and the relation Yli=i n i^ij = n ^j imply that a,j = Xj. 
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A. 2 Models with common orientations 

Subspace Ej! Starting from the likelihood expression (fT4j ). we can write: 



k d ( 1 \ k P ( 1 \ 

-2 log(L) = X) "* Z) ( log ( ai ) + -^^i^ +E n 'E ( lo S^) + ^W^ ) + cS *' 

i=l j'=l ^ 8 ' i=l j'=d+l ^ * ' 



J2 ni (d\og( ai ) + {p-d) log(6i)) + ^ gjA^ + ^ gpg 

i=l j=l j=d+l 



, c st 



where A = £f=i gW* and 5 = £- = i f^- Note that Tf j=d +iQj B Qj can be 
written using the trace of B: Yjj=d+i QjBqj = Tr(£?) — Ylj=iQj B< ij- This yields: 

k d 

-21og(L) = ^^(dlogC^ + Cp-dJlogf^-^gjCS-A^ + ^BJ-KttJ) 

i=i i=i 

Consequently, the gradient of C = — 2 log(L) — Y^j=\ QjiQjQj ~ 1) with respect to 

Qj is: 

V, 3 £ = -2(£ - A)^ - 29 j q i , 

where 6j is the jth Lagrange multiplier. The relation V ?J £ = is equivalent to 
{B — A)qj = —QjHj which means that qj is eigenvector of the matrix (B — A). In 
order to minimize the quantity — 21og(L), the d first columns of Q must be the 
eigenvectors associated with the d largest eigenvalues of (B — A) . 

Model [a.jbiQd] : Starting from equation (TT6|) , the partial derivatives of —2 log(L) 
with respect to Oj and 6, are: 

dlog(L) riid rii A t aiog(L) njjp-d) n, ( A . 

~ 2 ^- = ^r^E<^ and - 2 ^- = — — f Tr ^ - E^* 

* j=i * \ j=i , 

The condition ^ - = and ^ - = give respectively: 

d / d \ 
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Model [a,ibQd\: The partial derivative of — 21og(L) with respect to b is: 



-2 



B^ L = Mp^_n^ m _^^ 



and the condition °^ - = implies that: 



kQ) = j^U(w)-f; q *w q ^ 



Model [abiQd]: The partial derivative of — 21og(L) with respect to a is: 

_ 2 9i|(L) = ™;_„£ 

da a a 2 *-^ J 

3=1 

and the condition °^ - = proves that: 

1 d 
i=i 



A. 3 Models with common covariance matrices 
Subspace Ej: The log-likelihood can be written as follows: 

-2 log(L) = n ( J2 log(%) + (p - d) log(6) + ^p- + £ (±- - ±) gJWfy J +c 
The gradient of £ = — 2 log(L) — X^=i QjiQjQj ~ 1) with respect to <£,• is: 

v « c = 2n{ ~h ~ \ )Wq > ~ 29jQj > 

where 9j is the jth Lagrange multiplier. The relation V g .£ = implies that q^is 
eigenvector of W. In order to minimize — 21og(L), the first columns of Q must 
be the eigenvectors associated to the d largest eigenvalues of W. 
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Model [a,jbQd\: The partial derivatives of — 21og(L) with respect to a,j and b 
are: 

dlog(L) _ n n glog(L) _ n(p - d) n A 

2 0a,- ~ a, af^ J ^ 96 ~ 6 6^ Z. *i w, h- 

J J 3 j=d+l 

The condition ^ = implies that dj = Xj. The combination of the condition 

°g^ = with the relation Y^=d+i ^i = Tr(W) — Y^j=i -\? gi yes the estimator 

of b: 



(P 



^H"P') 



Model [abQd]: The partial derivatives of — 21og(L) with respect to a is: 

dlog(L) nd n v-^ , TTr 
da a a 2 ^^ J 

and the condition ^ - = implies that: 

1 d 

;5> 



1 , 

a 



d^-' r 

3=1 
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